Detection and mitigation for solid-state storage device read failures due to weak erase

ABSTRACT

Weak erase detection and mitigation techniques are provided that detect permanent failures in solid-state storage devices. One exemplary method comprises obtaining an erase fail bits metric for a solid-state storage device; and detecting a permanent failure in at least a portion of the solid-state storage device causing weak erase failure mode by comparing the erase fail bit metric to a predefined fail bits threshold. In at least one embodiment, the method also comprises mitigating for the permanent failure causing the weak erase failure mode for one or more cells of the solid-state storage device. The mitigating for the permanent failure comprises, for example, changing a status of the one or more cells to a defective state and/or a retired state. The detection of the permanent failure causing the weak erase failure mode comprises, for example, detecting the weak erase failure mode without an erase failure.

CROSS-REFERENCE TO RELATED APPLICATION(S)

The present application claims priority to U.S. Provisional PatentApplication Ser. No. 62/808,006, filed Feb. 20, 2019, entitled“Detection and Mitigation for Solid State Storage Device Read FailuresDue to Weak Erase,” incorporated by reference herein in its entirety.

SUMMARY

In one embodiment, a method comprises obtaining an erase fail bitsmetric for a solid-state storage device; and detecting a permanentfailure in at least a portion of the solid-state storage device causingweak erase failure mode by comparing the erase fail bit metric to apredefined fail bits threshold. In at least one embodiment, the methodalso comprises the step of mitigating for the permanent failure causingthe weak erase failure mode for one or more cells of the solid-statestorage device.

In some embodiments, the mitigating for the permanent failure compriseschanging a status of the one or more cells of the solid-state storagedevice to a defective state and/or a retired state, for example, at apage level resolution. The detection of the permanent failure causingthe weak erase failure mode comprises, for example, detecting the weakerase failure mode without an erase failure.

Other illustrative embodiments include, without limitation, apparatus,systems, controllers, methods and computer program products comprisingprocessor-readable storage media.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic block diagram of an illustrative solid-statestorage system, in accordance with one or more embodiments of thepresent disclosure;

FIG. 2 illustrates a flash channel read path with read reference voltagetracking, in accordance with some embodiments of the present disclosure;

FIG. 3 is a graph of cell voltage distributions for a normal harddecision read operation in a solid-state memory device, in accordancewith some embodiments of the present disclosure;

FIG. 4 illustrates bit counts as a function of the threshold voltagedistributions for exemplary pages exhibiting “weak erase” symptoms,according to some embodiments;

FIG. 5 is a flow chart illustrating an exemplary implementation of aweak erase detection and mitigation process, according to one embodimentof the disclosure;

FIG. 6 illustrates a number of pages as a function of the erase failbits distribution for detecting a weak erase condition, according tosome embodiments;

FIG. 7 is a flow chart illustrating an exemplary implementation of aninfant mortality screening process, according to at least one embodimentof the disclosure;

FIG. 8 is a flow chart illustrating an exemplary implementation of awear-induced weak erase mitigation process, according to one embodimentof the disclosure;

FIG. 9 is a flow chart illustrating an exemplary implementation of adouble erase fix mitigation process, according to an embodiment;

FIG. 10A illustrates a voltage distribution associated with an eraseverify fix mitigation technique, according to some embodiments;

FIG. 10B is a flow chart illustrating an exemplary implementation of anerase verify fix mitigation process, according to one embodiment of thedisclosure;

FIG. 11 illustrates voltage distributions associated with a pagecalibration mitigation technique for read failures, for a good eraseoperation and a weak erase operation, respectively, according to one ormore embodiments of the disclosure;

FIG. 12 illustrates an exemplary technique for determining erase failbits thresholds, according to some embodiments;

FIG. 13 illustrates exemplary pseudo code for a process for determiningthe erase fail bits thresholds in accordance with the techniques of FIG.12, according to one embodiment of the disclosure; and

FIG. 14 illustrates a processing platform that may be used to implementat least a portion of one or more embodiments of the disclosure.

DETAILED DESCRIPTION

Illustrative embodiments will be described herein with reference toexemplary solid-state storage devices and associated storage media,controllers, and other processing devices. It is to be appreciated,however, that these and other embodiments are not restricted to theparticular illustrative system and device configurations shown.Accordingly, the term “solid-state storage device” as used herein isintended to be broadly construed, so as to encompass, for example, anystorage device implementing the weak erase detection and mitigationtechniques described herein. Numerous other types of storage systems arealso encompassed by the term “solid-state storage device” as that termis broadly used herein.

In one or more embodiments, weak erase detection and mitigationtechniques are provided. “Weak Erase” (also referred to as “ShallowErase”) is a common failure mode for NAND Flash devices (2D/3D), andother solid-state memory devices, which may or may not result in eraseor program errors. Weak erase without an erase/program error can resultin an Uncorrectable Error Correction Code (UECC) read error at thesystem level. Thus, techniques are needed to detect such weak eraseerrors even when there is not a program or erase error.

In some embodiments, the disclosed weak erase detection and mitigationtechniques detect permanent failures in at least portions of thesolid-state storage device causing weak erase failures each time thatportion of the solid-state storage device is erased by comparing anerase fail bit metric to a predefined fail bits threshold. The phrase“permanent failure” shall be broadly construed and comprises, forexample, a failure caused by a failure mode that does not disappearacross erase cycles (and hence is expected to be repeatable acrossprogram/erase cycles). In addition, a permanent failure will repeat onthe same location (e.g., a die, block, or page) across program/erasecycles.

In addition, the permanent failure causing weak erase failure mode aremitigated for one or more cells of the solid-state storage device. Forexample, the mitigation for the permanent failure comprises changing astatus of the one or more cells of the solid-state storage device to adefective state and/or a retired state. In a further variation, thepermanent failure in the solid-state storage device causing weak erasefailure mode is detected without an erase failure.

The weak erase can cause an error in the data stored by the cell, andnegatively impact the performance of the NAND Flash memory device.

In addition, the Uncorrectable Bit Error Rate (UBER) of a NAND memorydevice can be impacted due to a “weak erase” failure mode, if the weakerase condition is not handled at NAND level or system level using thedisclosed weak erase detection and mitigation techniques. One or moreaspects of the present disclosure focus on “weak” erase failures causedby a permanent failure mode, such as “Leakage Current,” or a “PhysicalRandom Defect,” as opposed to a transient failure mode caused due tofailure modes such as “Open Block,” “Disturb,” or “Retention.” Apermanent failure mode persists on or after a subsequent eraseoperation, whereas a transient failure mode disappears after an eraseoperation.

Weak erase can be caused by, for example:

NAND flash design for life extension, causing a systematic impact;

a physical defect on the NAND device, causing an impact specific to adie, block and/or page; and/or

NAND “internal” leakage during an erase operation, causing an impactthat is specific to a die, block and/or page (or pages; in many casestwo subsequent pages in a physical wordline).

Thus, detecting weak erase and applying a corresponding system levelmitigation using the disclosed techniques is important for preventingdata loss.

FIG. 1 is a schematic block diagram of an illustrative solid-statestorage system 100. As shown in FIG. 1, the illustrative solid-statememory system 100 comprises a solid-state storage control system 110 anda solid-state storage media 150. The exemplary solid-state storagecontrol system 110 comprises a controller 120 and an encoder/decoderblock 130. In an alternative embodiment, the encoder/decoder block 130may be implemented inside the controller 120.

As shown in FIG. 1, the controller 120 comprises a weak erase detectionand mitigation process 500, discussed below in conjunction with FIG. 5,to implement the weak erase detection and mitigation techniquesdescribed herein. The encoder/decoder block 130 may be implemented, forexample, using well-known commercially available techniques and/orproducts. The encoder within the encoder/decoder block 130 mayimplement, for example, error correction encoding, such as a low-densityparity-check (LDPC) encoding. The decoder within the encoder/decoderblock 130 may be embodied, for example, as a hard decision decoder, suchas a hard decision low-density parity-check (HLDPC) decoder.

The solid-state storage media 150 comprises a memory array, such as asingle-level or multi-level cell flash memory, a NAND flash memory, aphase-change memory (PCM), a magneto-resistive random access memory(MRAM), a nano RAM (NRAM), a NOR (Not OR) flash memory, a dynamic RAM(DRAM) or another non-volatile memory (NVM). While the disclosure isillustrated primarily in the context of a solid-state storage device(SSD), the disclosed weak erase detection and mitigation techniques canbe applied in solid-state hybrid drives (SSHD) and other storagedevices, as would be apparent to a person of ordinary skill in the artbased on the present disclosure.

FIG. 2 illustrates a flash channel read path 200 with channeltracking-based read retry voltage adjustment in accordance with someembodiments of the present disclosure. The read path 200 includes aflash device 202 having an array of memory cells, or another type ofnon-volatile memory. Based upon the disclosure provided herein, one ofordinary skill in the art will recognize a variety of storagetechnologies that can benefit from the weak erase detection andmitigation techniques disclosed herein.

Read reference voltages 226 are applied to the flash device 202 by aread control device 224 in a series of N reads. Each memory cell is readN times, and the N reads result in read data 204 containing N bits permemory cell as a quantized version of the stored voltage on the memorycell. The read data 204 is buffered in a read buffer 206, and bufferedread data 210 from read buffer 206 is provided to a log likelihood ratio(LLR) generation circuit 212 (or likelihood generator, which can also beadapted to use plain likelihood values). The N bits for a memory cellare mapped to log likelihood ratios 214 for the memory cell in loglikelihood ratio generation circuit 212. In some embodiments, the loglikelihood ratio generation circuit 212 contains a lookup table thatmaps the read patterns in buffered read data 210 to log likelihoodratios 214.

A tracking module 230 receives the buffered read data 210 from the readbuffer 206, or from any other suitable source. Generally, channeltracking techniques adapt to the changes in read reference voltages tomaintain a desired performance level. Adaptive tracking algorithmstypically track variations in the solid-state storage channel andconsequently, help to maintain a set of updated channel parameters. Theupdated channel parameters are used, for example, to adjust readreference voltages. United States Published Patent Application No.2013/0343131, filed Jun. 26, 2012, entitled “Fast Tracking for FlashChannels,” and/or United States Published Patent Application No.2015/0287453, entitled “Optimization of Read Thresholds for Non-VolatileMemory,” (now U.S. Pat. No. 9,595,320) incorporated by reference hereinin their entirety, disclose techniques for adapting read referencevoltages.

The tracking module 230 identifies the intersection point betweenneighboring voltage distributions for a memory cell, in a known manner,and provides read reference voltage level V_(REF0) 232, including theread reference voltage V_(REF0) corresponding to the intersection. Whenthe read reference voltage V_(REF0) corresponding to the intersection isused for the soft read operation, it will result in a reduction in thebit error rate. The read reference voltage V_(REF0) is used in someembodiments as the first read reference voltage of a read retryoperation, and additional read reference voltages around V_(REF0) toobtain substantially all possible log likelihood ratio values. Thetracking module 230 thus generates the read reference voltage levelV_(REF0) 232 to be used in read retry operations. In other embodiments,V_(REF0) may not correspond to the intersection of the distributionsdepending on the tracking algorithm design, tracking inaccuracy, or theactual channel distributions deviating from Gaussian behavior in eitherthe peak or the tail. In other situations, V_(REF0) may coincide withthe intersection of the distributions but may not be applied first andthat would be accounted for in the calculations in 212 and 224.

The tracking module 230 also tracks the voltage distributions 234. Insome embodiments, the tracking module 230 calculates the voltagedistribution means and variances for each voltage distribution 234corresponding to each possible state in each memory cell. The voltagedistributions 234 can be calculated in any suitable manner based on theread data. As an example, the tracking module 230 can operate asdisclosed in U.S. Published Patent Application No. 2013/0343131, filedJun. 26, 2012, entitled “Fast Tracking for Flash Channels,” incorporatedby reference herein in its entirety. In some embodiments, the trackingmodule 230 tracks intersections without estimating means or variances.

For a two-state memory cell, or single-level memory cell, the trackingmodule 230 estimates the means and variances of the voltagedistributions of states “1” and “0”, as well as the read referencevoltage V_(REF0) that most reduces the bit error rate and which likelylies at the intersection of those distributions, in a known manner.

The tracking module 230 provides the voltage distributions 234 to thelog likelihood ratio generation circuit 212 for use in updating the loglikelihood ratio lookup table. The log likelihood ratio generationcircuit 212 is used to calculate likelihood ratios 214 for decoding byan LDPC (low-density parity-check) decoder 216 that generates decodeddata 220. The log likelihood ratio generation circuit 212 alsodetermines where to place the other N−1 read reference voltages aroundV_(REF0) 232 based on the voltage distributions 234 and on the readreference voltage V_(REF0) 232 to obtain substantially all possible loglikelihood ratio values when the read patterns in buffered read data 210are mapped to log likelihood ratios. The log likelihood ratio generationcircuit 212 determines where to place the other N−1 read referencevoltages around V_(REF0) 232, updates the lookup table, and provides theN−1 read reference voltage levels 222 to a read controller 224. It isimportant to note that the division of functionality is not limited tothe example embodiments disclosed herein. For example, in otherembodiments, the tracking module 230 calculates and provides readreference voltages around V_(REF0) 232 and provides those voltages tothe log likelihood ratio generation circuit 212, rather than the loglikelihood ratio generation circuit 212 determining where to place theother N−1 read reference voltages around V_(REF0) 232, and thesedivisions of functionality are to be seen as equivalent.

The read reference voltages are stored in log likelihood ratiogeneration circuit 212 in some embodiments, as calculated based on thelog likelihood ratio lookup table in log likelihood ratio generationcircuit 212 and on the voltage distribution means and variances 234 fromtracking module 230.

The read controller 224 controls read retry operations in the flashdevice 202, providing each of the N read reference voltages (includingV_(REF0) 232) to be used when reading the memory cells in the flashdevice 202. The read controller 224 initiates N reads of a page, withthe first read using read reference voltage V_(REF0) in someembodiments, and with the subsequent N−1 reads at read referencevoltages around V_(REF0) as determined by log likelihood ratiogeneration circuit 212.

FIG. 3 is a graph 300 of cell voltage distributions 311 through 318 fora normal hard decision read operation in a TLC flash memory device, inaccordance with some embodiments of the present disclosure. Theexemplary TLC flash memory device is a BiCS3 NAND flash memory fromToshiba Memory America, Inc. Other flash memory types and architecturescan be employed, as would be apparent to a person of ordinary skill inthe art, without departing from the present invention. The resultingvoltages read from the memory cell thus appear something like thedistributions 311-318 shown in the graph 300 of FIG. 3, rather thaneight distinct discrete voltage levels corresponding to the eight states111, 110, 100, 000, 010, 011, 001, 101 at the corresponding target statevoltage levels. Each distribution 311-318 will have a mean roughly equalto the target voltage for the respective state, and the variance willdepend upon the noise. Because the voltages on the memory cell are notaccurate, the voltages read back can vary according to the distributions311-318. In some embodiments, during the initial read of the memorycell, reference voltages R_(i) (i=1, 2, . . . , 7) (e.g., R₁ through R₇)are used during a read to determine the state of the memory cell,returning hard decisions about the state of the memory cell.

For example, in general, if the read voltage is below reference voltageR₁, a decision indicates that the memory cell is determined to be instate 111. If the read voltage is above reference voltage R₁ and belowreference voltage R₂, a decision indicates that the memory cell isdetermined to be in state 110, and so on.

The first, second, and third bits in a given state are often referred toas the most significant bits, center significant bits, and leastsignificant bits (MSB, CSB, LSB), respectively. In some embodiments, theread operation is divided into a process of reading LSB pages, centersignificant bit (CSB) pages and most significant bit (MSB) pages. States111, 011, 001 and 101, for example, correspond to a least significantbit value of 1, and states 110, 100, 000 and 010 correspond to a leastsignificant bit value of 0. When reading the least significant bit, forexample, the reference voltages R₁ and R₅ are applied to the memory cellto obtain the least significant bit, for the exemplary Toshiba TLC NANDdevice.

While FIG. 3 illustrates the cell voltage distributions for a TLC flashmemory, the disclosed read parameter prediction techniques can beapplied to SLC, MLC, QLC, etc. and other flash memory systems, as wouldbe apparent to a person of ordinary skill in the art.

For each R_(i) shown in FIG. 3, a pair (V_(oi), P_(i)) can be derivedfor each page number, where V_(oi) is the substantially optimalreference voltage that substantially minimizes the bit error rate, andP_(i) is the corresponding page number.

Weak Erase Detection and Mitigation Techniques

FIG. 4 illustrates bit counts as a function of the threshold voltagedistributions 400 for exemplary pages exhibiting “weak erase” symptoms,according to some embodiments. As shown in FIG. 4, the weak erasecondition results in overlapped States for States L0 (Erased) and L1(Programmed), for example. The weak erase condition may not result inerase or program errors. Typically, the weak erase failure manifestsitself into UECC read errors on LSB pages (or whichever Page Typecorresponds to a Read level between an Erase state and an adjacentProgram state). The term “LSB page(s),” as used herein, shall be broadlyconstrued to refer to a Page Type corresponding to a Read level betweenan Erase state and an adjacent Program state, depending on the flashtype of a given NAND device.

FIG. 5 is a flow chart illustrating an exemplary implementation of aweak erase detection and mitigation process 500, according to oneembodiment of the disclosure. As shown in FIG. 5, the exemplary weakerase detection and mitigation process 500 initially performs a weakerase detection during step 510. Generally, the weak erase detection isperformed prior to a program and/or read operation or after a firstoccurrence of a read failure requiring an outer code (OCR). In someembodiments, an Erase Fail Bits (EFB) per page metric is used as anindicator of a weak erase condition, as discussed further below inconjunction with FIG. 6.

In one or more embodiments, the exemplary EFB threshold is determined,as follows:

EFB=Erase Fail Bits (e.g., using FIG. 6, discussed below); and

PFB=Programmed Fail Bits (e.g., raw bit flips—bits read in errorfollowing programming).

First, PFB is determined:

PFB=f (EFB) (e.g., correlation, as discussed further below inconjunction with FIG. 12) for every NAND type during NAND Testing whichneeds to include Weak Erase samples.

Then, the EFB threshold can be determined based on an acceptable PFBbased on available ECC and screening/defecting criterion, as would beapparent to a person of ordinary skill in the art.

During step 520, the exemplary weak erase detection and mitigationprocess 500 implements an infant mortality screening process 700, asdiscussed further below in conjunction with FIG. 7, to perform thefactory screening for “infant mortality” failures. Generally, if theerase fail bits per page exceeds a predefinedfactory_erase_fail_bits_threshold, then the block is defected.

During step 530, the exemplary weak erase detection and mitigationprocess 500 implements a wear-induced weak erase mitigation process 800,as discussed further below in conjunction with FIG. 8, to apply aretirement policy for “wear” induced failures. Generally, upon a readerror, the exemplary wear-induced weak erase mitigation process 800determines if the erase fail bits per page exceeds a predefinedpage_defect_erase_fail_bits_threshold, then the wear-induced weak erasemitigation process 800 defects the page/pages. It is noted thatdefecting/retiring SSD cells provides a “permanent” mitigation byprohibiting any future use of the defected or retired SSD cells.

During step 540, the exemplary weak erase detection and mitigationprocess 500 implements a double erase fix mitigation process 900, asdiscussed further below in conjunction with FIG. 9, to apply a doubleerase fix for a weak erase failure. Generally, if the erase fail bitsper page exceeds a predefined double_erase_fail_bits_threshold, then asecond erase operation is performed to allow a deeper erase. Asdiscussed further below, the double erase fix mitigation process 900 canoptionally perform a permanent mitigation by marking a block to have adouble erase operation performed on every erase (and then every futureerase operation uses a double erase).

During step 550, the exemplary weak erase detection and mitigationprocess 500 implements an erase verify fix, as discussed further belowin conjunction with FIGS. 10A and 10B, whereby the erase verify voltageis shifted in a negative direction using Test Modes to apply additionalErase Pulse(s) causing a deeper erase. As discussed further below, anerase verify fix mitigation process 1010 can optionally perform apermanent mitigation by marking a block to perform an updated eraseverify on every erase operation (and then every future erase operationuses an updated erase verify).

Finally, during step 560, the exemplary weak erase detection andmitigation process 500 implements a page calibration for read failures.Generally, the page calibration performs an “R1” Read Level Calibrationusing a predetermined Vref Sweep Window on LSB page (or whichever PageType corresponds to a Read level between an Erase state and an adjacentProgram state, for different flash memory types) failures, as discussedfurther below in conjunction with FIG. 11.

Weak Erase Detection

FIG. 6 illustrates a number of pages as a function of the EFBdistribution for detecting a weak erase condition, according to someembodiments. Generally, FIG. 6 illustrates passing pages compared toUECC failing pages. It has been found that weak erase impacted pagesdemonstrated significantly higher fail bits after an erase (e.g., priorto program) (referred to as erase fail bits), for example, on LSB Pages(or other page types impacted by the weak erase condition, such aswhichever Page Type corresponds to a Read level between an Erase stateand an adjacent Program state) when compared with pages without WeakErase). It was found that all UECC failures were due to weak erase inthe tested samples.

One or more aspects of the disclosure recognize that erase fail bits onLSB Pages can be used for detecting a weak erase prior to programming.

Weak Erase Mitigation

FIG. 7 is a flow chart illustrating an exemplary implementation of aninfant mortality screening process 700, according to at least oneembodiment of the disclosure. As shown in FIG. 7, the exemplary infantmortality screening process 700 initially completes a NAND factory testduring step 710 with program erase (PE) equal to “x” cycles. Thereafter,during step 720, a one pass erase/program is performed on all blocks ofthe SSD, and all blocks are then erased during step 730. The exemplaryinfant mortality screening process 700 checks the erase fail bits/pageon LSB pages (for the exemplary NAND flash device type) during step 740.For other NAND types, the erase fail bits/page would be checked on pagescorresponding to a read state adjacent to the erase state (e.g., MSB orCSB pages for different NAND types).

A test is performed during step 750 to determine if the fail bits/pageexceeds a predefined infant_erase_fail_bits threshold. When it isdetermined during step 750 that the fail bits/page exceeds thepredefined infant_erase_fail_bits threshold, then the block is defectedduring step 760. It is noted that the defect mitigation is applied atthe block level during step 760, for example, before any user data isprogrammed to a flash location. In this manner, data loss is prevented.

FIG. 8 is a flow chart illustrating an exemplary implementation of awear-induced weak erase mitigation process 800, according to at leastone embodiment of the disclosure. As shown in FIG. 8, the exemplarywear-induced weak erase mitigation process 800 initially performs a testduring step 810 to determine if the OCR triggered on LSB Page number“N”. When it is determined during step 810 that the OCR triggered on LSBPage number “N,” the failing block is added to an Erase Suspect ScanList during step 820. It is noted that the failing page number is N andN+Y and N−Y are two adjacent pages of the same page type as the failingpage (where the value of “Y” depends on the particular 3D Flash BlockArchitecture). For example, for a failing LSB page, two adjacent LSBpages are also used. While LSB pages are considered in the exemplaryembodiments, other Page Types may be impacted that correspond to a Readlevel between an Erase state and an adjacent Program state, as would beapparent to a person of ordinary skill in the art.

During step 830, the failing block is erased (ensuring that the failingblock or Page numbers N, N+Y and N−Y are programmed prior to erasing).The erase fail bits are checked during step 840 on Page numbers N, N+Yand N−Y.

A test is performed during step 850 to determine if the fail bits/pageexceeds a predefined erase_fail_bits_page_defect_threshold. If it isdetermined during step 850 that the fail bits/page exceeds thepredefined erase_fail_bits_page_defect_threshold, then the failing LSBpages (or other appropriate page types for different NAND types, asdescribed herein) are retired during step 870.

If, however, it is determined during step 850 that the fail bits/page donot exceed the predefined erase_fail_bits_page_defect_threshold, thenthe block is removed from the erase suspect scan list during step 860.

In this manner, defect mitigation is applied at the page level after afirst occurrence of error, before a subsequent program, therebypreventing future data loss.

FIG. 9 is a flow chart illustrating an exemplary implementation of adouble erase fix mitigation process 900, according to an embodiment. Asshown in FIG. 9, the exemplary double erase fix mitigation process 900initially tests to determine if the OCR triggered on LSB page number N(for the exemplary NAND flash device type; or other pages susceptible toweak erase errors in other NAND device types), during step 910. WhileLSB pages are considered in the exemplary embodiments, other Page Typesmay be impacted that correspond to a Read level between an Erase stateand an adjacent Program state, as would be apparent to a person ofordinary skill in the art.

When the OCR triggers on LSB Page number N during step 910, the failingblock is erased during step 920, and the erase fail bits on the failingpage (#N), and two adjacent pages of the same page type (#N−Y and #N+Y)are checked during step 930.

A test is performed during step 940 to determine if the erased fail bitsexceeds a predefined double_erase_fail_bits_threshold. If it isdetermined during step 940 that the erased fail bits does not exceed thepredefined double_erase_fail_bits_threshold, then the failing page isnot due to a weak erase (step 950).

If it is determined during step 940 that the erased fail bits exceedsthe predefined double_erase_fail_bits_threshold, then another eraseoperation is performed on the failing block during step 960 (e.g., adouble erase), and the erase fail bits on the failing page (#N), and twoadjacent pages of the same page type (#N−Y and #N+Y) are checked againduring step 970.

A test is performed during step 980 to again determine if the erasedfail bits exceeds a predefined double_erase_fail_bits_threshold. If itis determined during step 980 that the erased fail bits exceeds thepredefined double_erase_fail_bits_threshold, then the failing page isdefected during step 990.

If, however, it is determined during step 980 that the erased fail bitsdoes not exceed the predefined double_erase_fail_bits_threshold, thenthe block can continue to be used during step 995 for the next programoperation.

Among other benefits, the exemplary double erase fix mitigation process900 applies defect mitigation at the page level after a first occurrenceof an error, before a subsequent programming, thereby preventing futuredata loss.

As noted above, the double erase fix mitigation process 900 canoptionally perform a permanent mitigation by marking a block to have adouble erase on every erase operation (and then every future eraseoperation uses a double erase).

It is noted that in some embodiments thedouble_erase_fail_bits_threshold can be determined based on:

(1) a Transfer Function Between EFB and PFB;

(2) an ECC capability; and/or

(3) a required ECC margin.

Generally, if the double_erase_fail_bits_threshold is violated followingthe second erase operation, then the page is defected during step 990.If, however, the double_erase_fail_bits_threshold is not violatedfollowing the second erase operation, then the block can continue to beused for the next program operation during step 995.

In this manner, bad pages in the block are removed and the remainingpages can still be used after applying mitigation.

FIG. 10A illustrates a voltage distribution 1000 associated with anerase verify fix mitigation technique, according to some embodiments. Asshown in FIG. 10A, the exemplary erase verify fix mitigation techniqueshifts an original erase verify voltage 1020 in a negative directionusing Test Modes to an updated erase verify voltage 1010, to applyadditional erase pulse(s) causing a deeper erase.

Generally, the exemplary erase verify fix mitigation technique mitigatesat the page level after a first occurrence of a read failure, before asubsequent programming of the cells (since the cells are susceptible toweak erase failures), thereby preventing future data loss.

FIG. 10B is a flow chart illustrating an exemplary implementation of anerase verify fix mitigation process 1010, according to one embodiment ofthe disclosure. As shown in FIG. 10B, the exemplary erase verify fixmitigation process 1010 initially tests to determine if the RAIDtriggered on LSB page number N (for the exemplary NAND flash devicetype; or other pages susceptible to weak erase errors in other NANDdevice types), during step 1015. When the RAID triggers on LSB Pagenumber N during step 1015, the failing block is erased during step 1020,and the erase fail bits on the failing page (#N), and two adjacent pagesof the same page type (#N−Y and #N+Y) are checked during step 1030.

A test is performed during step 1040 to determine if the erased failbits exceeds a predefined erase_verify_fail_bits_threshold. If it isdetermined during step 1040 that the erased fail bits does not exceedthe predefined erase_verify_fail_bits_threshold, then the failing pageis not due to a weak erase (step 1050).

If it is determined during step 1040 that the erased fail bits exceedsthe predefined erase_verify_fail_bits_threshold, then the block isprogrammed with dummy data during step 1060, and the predeterminednegative voltage shift is applied to the erase verify voltage duringstep 1070. The block is then erased during step 1080.

A test is performed during step 1090 to again determine if the erasedfail bits exceeds a predefined erase_verify_fail_bits_threshold. If itis determined during step 1090 that the erased fail bits exceeds thepredefined erase_verify_fail_bits_threshold, then the failing page isdefected during step 1095.

If, however, it is determined during step 1090 that the erased fail bitsdoes not exceed the predefined erase_verify_fail_bits_threshold, thenthe block can continue to be used during step 1098 for the next programoperation.

Among other benefits, the exemplary erase verify fix mitigation process1010 applies defect mitigation at the block level after a firstoccurrence of an error, before a subsequent programming, therebypreventing future data loss.

As noted above, the exemplary erase verify fix mitigation process 1010can optionally perform a permanent mitigation by marking a block toperformed an updated erase verify on every erase operation (and thenevery future erase operation uses the updated erase verify). It is notedthat in some embodiments, the erase_verify_fail_bits_threshold can bethe same or different than the double_erase_fail_bits_threshold, and canbe determined during NAND characterization testing.

FIG. 11 illustrates voltage distributions 1100, 1150 associated with apage calibration mitigation technique for read failures, for a gooderase operation and a weak erase operation, respectively, according toone or more embodiments of the disclosure. As noted above, the exemplarypage calibration performs an “R1” read level calibration using apredetermined Vref Sweep Window on LSB page failures. Assume R1 is aread level between an Erase State (L0) and a Program State (L1), asshown in FIG. 11. It has been found that a weak erase condition impactsa page type that uses the R1 read level as one of the read levels duringa page read.

Following a good or successful erase operation (associated with thevoltage distribution 1100), a reference voltage, Vref, is employed.Following a weak erase condition (associated with the voltagedistribution 1150), the exemplary page calibration mitigation techniquefor read failures performs a page calibration to determine a newcalibrated reference voltage, Vref.

Generally, the exemplary page calibration mitigation technique salvagesthe data following a programming of the cells since the cells aresusceptible to weak erase failures.

In some embodiments, the mitigation associated with the page calibrationmitigation technique is applied at a page level on every occurrence ofan error, thereby permitting data recovery.

FIG. 12 illustrates an exemplary technique 1200 for determining erasefail bits thresholds, according to some embodiments. In particular, FIG.12 illustrates the PFB as a function of the EFB, with a relation of:PFB=m*EFB+C,  (i)in some embodiments.

FIG. 13 illustrates exemplary pseudo code 1300 for a process fordetermining the erase fail bits thresholds in accordance with thetechniques of FIG. 12, according to one embodiment of the disclosure. Asshown in FIG. 13, the exemplary pseudo code 1300 initially determinesthe PFB as a function of the EFB, during step 1, for example, duringcomponent evaluation. Generally, the component evaluation needs toinclude “weak erase” samples.

In the example of FIG. 13, the PFB as a function of the EFB is a linearfunction, according to equation (i):PFB=m*EFB+C,where m and c are constants.

During step 2.A, an acceptable fail bits PFBFactory threshold isestablished, for example, based on a required infant mortality bit errorrate (BER) and applicable ECC.

During step 2.B, an acceptable fail bits PFBRetirement threshold isestablished, for example, based on an applicable ECC during flash life.

During step 3, EFBFactory and EFBRetirement thresholds are determined bysubstituting the results of steps 2.A and 2.B in equation (i).

It is noted that the Double_Erase_Fail_Bits_Threshold can be the same orlower than the Page_Defect_Erase_Fail_Bits_Threshold. In addition, theabove example of FIG. 12 shows a linear function between EFB and PFB,but a transfer function between EFB and PFB could be linear ornon-linear, as would be apparent to a person of ordinary skill in theart.

If no correlation is found between EFB and PFB, the thresholds can bedetermined based on EFB observed on weak erase locations resulting intoUECC. Finally, it is noted that a different method to determine thetransfer function between two parameters is beyond the scope of thepresent disclosure.

In one or more embodiments of the disclosure, techniques are providedfor read parameter prediction. It should be understood that the readparameter prediction techniques illustrated in FIGS. 1 through 13 arepresented by way of illustrative example only, and should not beconstrued as limiting in any way. Numerous alternative configurations ofsystem and device elements and associated processing operations can beused in other embodiments.

In at least some embodiments, the disclosed weak erase detections areperformed only on weak erase impacted pages and/or page type (where thepage type corresponds to a read level between an Erase state and anadjacent Program state) for a quicker detection and mitigation.

Illustrative embodiments disclosed herein can provide a number ofsignificant advantages relative to conventional arrangements. Forexample, one or more embodiments provide a significantly improvedmitigation for weak erase failures, including defecting or retiringblocks/pages or making the erase distribution deeper, to prevent readfailures followed by a number of recovery steps.

It is to be appreciated that the particular advantages described aboveand elsewhere herein are associated with particular illustrativeembodiments and need not be present in other embodiments. Also, theparticular types of weak erase detection and mitigation features andfunctionality as illustrated in the drawings and described above areexemplary only, and numerous other arrangements may be used in otherembodiments.

As mentioned previously, at least portions of the disclosed weak erasedetection and mitigation system may be implemented using one or moreprocessing platforms. A given such processing platform comprises atleast one processing device comprising a processor coupled to a memory.The processor and memory in some embodiments comprise respectiveprocessor and memory elements of a virtual machine or container providedusing one or more underlying physical machines. The term “processingdevice” as used herein is intended to be broadly construed so as toencompass a wide variety of different arrangements of physicalprocessors, memories and other device components as well as virtualinstances of such components. For example, a “processing device” in someembodiments can comprise or be executed across one or more virtualprocessors. Processing devices can therefore be physical or virtual andcan be executed across one or more physical or virtual processors. Itshould also be noted that a given virtual device can be mapped to aportion of a physical one.

The disclosed weak erase detection and mitigation arrangements may beimplemented using one or more processing platforms. One or more of theprocessing modules or other components may therefore each run on acomputer, storage device or other processing platform element. A givensuch element may be viewed as an example of what is more generallyreferred to herein as a “processing device.”

Referring now to FIG. 14, one possible processing platform that may beused to implement at least a portion of one or more embodiments of thedisclosure is shown. The processing platform 1400 in this embodimentcomprises at least a portion of the given system and includes at leastone processing device(s), denoted 1402-1, 1402-2, 1402-3, . . . 1402-D,which communicate with one another over a network 1404. The network 1404may comprise any type of network, such as the Internet, a wireless areanetwork (WAN), a local area network (LAN), a satellite network, atelephone or cable network, a cellular network, a wireless network suchas WiFi or WiMAX, or various portions or combinations of these and othertypes of networks.

The processing device 1402-1 in the processing platform 1400 comprises aprocessor 1410 coupled to a memory 1412. The processor 1410 may comprisea microprocessor, a microcontroller, an application specific integratedcircuit (ASIC), a field programmable gate array (FPGA) or other type ofprocessing circuitry, as well as portions or combinations of suchcircuitry elements. The memory 1412 may comprise random access memory(RAM), read only memory (ROM) or other types of memory, in anycombination. The memory 1412 and other memories disclosed herein shouldbe viewed as illustrative examples of what are more generally referredto as “processor-readable storage media” storing executable program codeof one or more software programs.

Also included in the processing device 1402-1 is network interfacecircuitry 1414, which is used to interface the processing device withthe network 1404 and other system components, and may compriseconventional transceivers.

The other processing devices 1402, if any, of the processing platform1400 are assumed to be configured in a manner similar to that shown forprocessing device 1402-1 in the figure.

Again, the particular processing platform 1400 shown in the figure ispresented by way of example only, and the given system may includeadditional or alternative processing platforms, as well as numerousdistinct processing platforms in any combination, with each suchplatform comprising one or more computers, storage devices or otherprocessing devices.

Multiple elements of the system may be collectively implemented on acommon processing platform of the type shown in FIG. 14, or each suchelement may be implemented on a separate processing platform.

Articles of manufacture comprising such processor-readable storage mediaare considered illustrative embodiments. A given such article ofmanufacture may comprise, for example, a storage array, a storage diskor an integrated circuit containing RAM, ROM or other electronic memory,or any of a wide variety of other types of computer program products.The term “article of manufacture” as used herein should be understood toexclude transitory, propagating signals. Numerous other types ofcomputer program products comprising processor-readable storage mediacan be used.

Again, the particular processing platform 1400 shown in FIG. 14 ispresented by way of example only, and the weak erase detection andmitigation system may include additional or alternative processingplatforms, as well as numerous distinct processing platforms in anycombination, with each such platform comprising one or more computers,servers, storage devices or other processing devices.

It should therefore be understood that in other embodiments differentarrangements of additional or alternative elements may be used. At leasta subset of these elements may be collectively implemented on a commonprocessing platform, or each such element may be implemented on aseparate processing platform.

Also, numerous other arrangements of computers, servers, storage devicesor other components are possible in the weak erase detection andmitigation system. Such components can communicate with other elementsof the weak erase detection and mitigation system over any type ofnetwork or other communication media.

As indicated previously, components of an information processing systemas disclosed herein can be implemented at least in part in the form ofone or more software programs stored in memory and executed by aprocessor of a processing device. For example, at least portions of thefunctionality of the processes of FIGS. 5-13 are illustrativelyimplemented in the form of software running on one or more processingdevices.

It should again be emphasized that the above-described embodiments arepresented for purposes of illustration only. Many variations and otheralternative embodiments may be used. For example, the disclosedtechniques are applicable to a wide variety of other types ofinformation processing systems and weak erase detection and mitigationsystems. Also, the particular configurations of system and deviceelements and associated processing operations illustratively shown inthe drawings can be varied in other embodiments. Moreover, the variousassumptions made above in the course of describing the illustrativeembodiments should also be viewed as exemplary rather than asrequirements or limitations of the disclosure. Numerous otheralternative embodiments within the scope of the appended claims will bereadily apparent to those skilled in the art.

What is claimed is:
 1. A method of detecting weak erase failure in astorage media of a solid-state storage device, the method comprisingsteps of: prior to a program operation of a portion of the storage mediaor after a read failure of the portion requiring an outer code,determining, by a controller of the solid-state storage device, an erasefail bits metric for the portion of the storage media; detecting, by thecontroller, a permanent failure in the portion of the storage media ofthe solid-state storage device causing weak erase failure mode bydetermining that the erase fail bits metric for the portion of thestorage media exceeds a predefined erase fail bits threshold; and upondetecting the permanent failure in the portion of the storage media ofthe solid-state storage device causing the weak erase failure mode,mitigating, by the controller, for the weak erase failure mode in theportion of the storage media of the solid-state storage device byperforming one or more of: defecting one or more blocks of thesolid-state storage device when the erase fail bits metric exceeds apredefined factory erase fail bits threshold, the predefined factoryerase fail bits threshold being based on a factory screening criteria,marking one or more pages of the solid-state storage device as defectivewhen the erase fail bits metric exceeds a predefined page defect erasefail bits threshold, the predefined page defect erase fail bitsthreshold being based on a retirement policy for defective pages,performing two erase operations on one or more blocks of the solid-statestorage device when the erase fail bits metric exceeds a predefineddouble erase fail bits threshold, the predefined double erase fail bitsthreshold being based on an error correction code capability criteria,shifting an erase verify voltage in a predefined direction to apply oneor more additional erase pulses to one or more impacted blocks of thesolid-state storage device, and performing a read voltage calibrationfor a read level between an erase state and adjacent program state ofthe solid-state storage device using a predefined reference voltagesweep for one or more failures on one or more pages impacted by the oneor more failures.
 2. The method of claim 1, wherein mitigating for theweak erase failure mode further comprises changing a status of one ormore cells of the solid-state storage device to one or more of adefective state and a retired state at a page level resolution.
 3. Themethod of claim 1, wherein detecting the permanent failure in thesolid-state storage device causing the weak erase failure mode comprisesdetecting the weak erase failure mode without an erase failure.
 4. Themethod of claim 1, wherein one or more of the predefined factory erasefail bits threshold, the predefined page defect erase fail bitsthreshold, and the predefined double erase fail bits threshold areobtained using a correlation between the erase fail bits metric and aprogrammed fail bits metric.
 5. The method of claim 1, wherein thepredefined factory erase fail bits threshold evaluates one or more of aninfant mortality bit error rate and an applicable error correction code.6. The method of claim 1, wherein the predefined page defect erase failbits threshold evaluates an applicable error correction code during alife of the solid-state storage device.
 7. The method of claim 1,wherein the predefined double erase fail bits threshold evaluates one ormore of a correlation between the erase fail bits metric and aprogrammed fail bits metric and a required error correction code margin.8. A tangible machine-readable recordable storage medium, containing oneor more software programs that, when executed by the controller of thesolid-state storage device, implement the steps of the method ofclaim
 1. 9. A controller of a storage device, comprising: a memorystoring a weak erase detection and mitigation process; and at least oneprocessor coupled to the memory and to a solid-state storage media ofthe storage device, the weak erase detection and mitigation processoperative to cause the at least one processor to: prior to a programoperation of a portion of the solid-state storage media or after a readfailure of the portion requiring an outer code, determine an erase failbits metric for the portion of the solid-state storage media, detect apermanent failure in the portion of the solid-state storage mediacausing weak erase failure mode by determining that the erase fail bitsmetric for the portion of the solid-state storage media exceeds apredefined erase fail bits threshold, and upon detecting the permanentfailure in the portion of the solid-state storage media causing weakerase failure mode, mitigate for the weak erase failure mode in theportion of the solid-state storage media; wherein mitigating for theweak erase failure mode in the portion of the solid-state storage mediacomprises one or more of: defecting one or more blocks of thesolid-state storage media when the erase fail bits metric exceeds apredefined factory erase fail bits threshold, the predefined factoryerase fail bits threshold being based on a factory screening criteria,marking one or more pages of the solid-state storage media as defectivewhen the erase fail bits metric exceeds a predefined page defect erasefail bits threshold, the predefined page defect erase fail bitsthreshold being based on a retirement policy for defective pages,performing two erase operations on one or more blocks of the solid-statestorage media when the erase fail bits metric exceeds a predefineddouble erase fail bits threshold, the predefined double erase fail bitsthreshold being based on an error correction code capability criteria,shifting an erase verify voltage in a predefined direction to apply oneor more additional erase pulses to one or more impacted blocks of thesolid-state storage media, and performing a read voltage calibration fora read level between an erase state and adjacent program state of thesolid-state storage media using a predefined reference voltage sweep forone or more failures on one or more pages impacted by the one or morefailures.
 10. The controller of claim 9, wherein mitigating for the weakerase failure mode comprises changing a status of one or more cells ofthe solid-state storage media to one or more of a defective state and aretired state at a page level resolution.
 11. The controller of claim 9,wherein the detecting the permanent failure in the portion of thesolid-state storage media causing the weak erase failure mode comprisesdetecting the weak erase failure mode without an erase failure.
 12. Asolid-state storage device comprising: a non-volatile storage media; anda controller operatively connected to the storage media and comprising amemory and a processor, the controller configured to perform thefollowing steps: prior to a program operation of a portion of thestorage media or after a read failure of the portion requiring an outercode, determine an erase fail bits metric for the portion of the storagemedia, detect a permanent failure in the portion of the storage mediacausing weak erase failures by determining that the erase fail bitsmetric for the portion exceeds a predefined erase fail bits threshold,and upon detecting the permanent failure in the portion of the storagemedia causing the weak erase failures, mitigate effects of the weakerase failures in the portion of the storage media by performing one ormore of: changing a status of one or more cells of the solid-statestorage device to one or more of a defective state and a retired stateat a page level resolution, performing two erase operations on one ormore blocks of the storage media, and performing an erase operationwhile shifting an erase verify voltage in a predefined direction toapply one or more additional erase pulses to one or more impacted blocksof the storage media.